1. Overview
The FAIR Data Principles are a set of guidelines that are meant to enhance the ability to find, access, integrate, and re-use data. The acronym, FAIR, stands for Findable, Accessible, Interoperable, and Reusable; each category highlights a critical component in the reproducibility of results and re-usability of data. There are 15 core principles spread across the four categories and are described in more detail by the FORCE11 community, a group focused on “Improving Future Research Communication and e-Scholarship”.
“These high-level FAIR Guiding Principles precede implementation choices, and do not suggest any specific technology, standard, or implementation-solution; moreover, the Principles are not, themselves, a standard or a specification. They act as a guide to data publishers and stewards to assist them in evaluating whether their particular implementation choices are rendering their digital research artefacts Findable, Accessible, Interoperable, and Reusable.”
DataONE and the Arctic Data Center (ADC) implement FAIR principles through 51 checks for individual pieces of information within the relevant metadata record. All of the checks return values of TRUE/FALSE and include the assessment of presence, length, and content among several other kinds. These 51 checks are spread across the four FAIR categories and are used to assess both the overall score, as well as aggregate score of a single category.
FAIR scores are calculated in two ways: overall - across each FAIR category, and within a FAIR category. The FAIR scores are calculated using the algorithm below, using the Required and Optional checks for each category. The overall scores is calculated using the same algorithm, but is calculated across all checks.
\[score_{overall} = \frac{R_{pass} + O_{pass}}{R_{pass} + R_{fail} + O_{pass}}\]
In the report below, we will examine trends in four aggregate FAIR scores and in the individual FAIR checks. We will also evaluate the effect of data curation by the ADC’s Data Team on metadata quality and the associated FAIR metrics. Additional analyses include examining the relationship between FAIR scores and views, as well as those associated with negative FAIR checks.
These analyses show that:
- all aggregate FAIR scores increase over time,
- the major FAIR improvements occur in Accessibility, Interoperability, and Re-usability,
- Data Team curation increases FAIR scores across all categories,
- data curation has gotten better over time and it has increasingly improved the metadata associated with data packages,
- FAIR scores for initial submissions continue to increase over time,
- data curation improves 34 individual metadata checks for information.
2. FAIR Scores Over Time
All aggregate FAIR scores have increased since the Arctic Data Center openned on March 21, 2016. This is true of the Overall score, but also with each of the categories that make up FAIR: Findability, Accessibility, Interoperability, and Re-usability.
The gains in the Overall FAIR scores are driven by large improvements in the Accessibility, Interoperability, and Re-usability scores. These three score categories have seen the largest gains since the opening of the ADC; each of their scores have increased by ~0.5.
The Advanced Cooperative Arctic Data and Information Service (ACADIS) is the precursor to the ADC. ACADIS maintained data archive infrastructure and provided services to support projects funded by the Office of Polar Programs. The ADC inherited 3683 datasets from ACADIS, which are still held in the repository today (noted on the upper-left side of Figure 2-1, above). ACADIS data comes from two sources, the ACADIS Gateway (~2500 dataset) and the Earth Observing Laboratory (500-1000 datasets).
3. Data Team Curation Practices
All data submission are reviewed by the ADC’s Data Team. The Data Team then works with the submitter to resolve any issues, which includes adding additional information to multiple metadata fields. These data curation steps are reflected in a data packages FAIR scores, which increase substantially between a packages initial submission and final publication.
FAIR scores improve through Data Team curation in all categories, which includes the Overall score. Note that the FAIR categories with the greatest change after curation – Accessibility, Interoperability, and Re-usability – are also those that showed the greatest amount of improvement since the opening of the ADC.
Previous analyses have shown that the average FAIR score has increased since the ADC’s inception. These improvements come from two areas: improved FAIR scores of initial submissions, and curation processes that are increasingly able to improve metadata quality of data package submissions.
The increased ability of the Data Team to improve metadata quality is best showcased by examining the widening gap over time between initial FAIR scores and those of the final data package This is particularly true for the Interoperable and Re-usable categories.
4. Individual FAIR Checks
There are 51 individual checks for information within metadata records that inform the calculation of FAIR scores. These individual checks are divided among the four FAIR categories of Findable, Accessible, Interoperable, and Reusable. The following figures examine the change in FAIR scores between the initial data package and the final versions; these analyses are similar to the one conducted on the four, higher-level FAIR categories in Figure 3-1.
Of the 51 individual checks shown above, 34 are improved during data curation with 4 near perfect at initial submission. This leaves 13 checks to target in future process improvements or FAIR package updates, if information is stored in different locales outside of the EML metadata. The increases in both the overall FAIR score, as well as within each category, that were noted in Figure 2-1 are the result of improvements to the individual checks seen here.
When plotted with data prior to the opening of the ADC, additional value beyond Data Team curation becomes visible. Initial submissions directly to the ADC since 2016-03-21, regularly have higher FAIR scores than the final versions of pre-ADC data. This is particularly true of individual checks within the Interoperable and Re-usable categories. This demonstrates additional value to the ADC’s ecosystem beyond the significant value added by the Data Team.
5. gganimate
Here is an animated version of Figure 2-1. A lower resolution version is also easily created using the source script. Note that this figure cannot currently be generated on Aurora (as of 2020-11-18).